feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (76.54%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2887 +/- ##
==========================================
- Coverage 86.27% 86.22% -0.05%
==========================================
Files 715 720 +5
Lines 272064 272971 +907
==========================================
+ Hits 234712 235374 +662
- Misses 36828 37073 +245
Partials 524 524
🚀 New features to boost your workflow:
|
| ScalarValue::Utf8(v) | ScalarValue::LargeUtf8(v) => { | ||
| Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))) | ||
| } | ||
| ScalarValue::Utf8View(v) => Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))), | ||
| ScalarValue::Binary(v) | ScalarValue::LargeBinary(v) => { | ||
| Ok(v.as_ref().map(|b| murmur3_32(b))) | ||
| } |
There was a problem hiding this comment.
Why do we support different types when handling a scalar versus handling an Array?
For the Arrray case, it looks like we only support Utf8, LargeUtf8 and Binary. I think these supported types should maybe be consistent, unless there's a good reason for them not to be? Same comment for fnv, sha1 and xxh
There was a problem hiding this comment.
Good catch, made the array path consistent with the scalar path by adding Utf8View and LargeBinary support to hash_array in all five udf files.
|
Hey, all changes addressed and |
|
Hey @albertlockett, I fixed the bad merge in assign.rs, could you please re-run CI? I ran the full suite before pushing, so it should finally work. |
ace4972
Change Summary
Add seven hash functions to the OTAP query-engine, per #2834:
md5(): MD5 digest, backed by DataFusion's built-incrypto::md5UDF.sha1(): SHA-1 digest, implemented as a customScalarUDFusing thesha1crate (DataFusion has no SHA-1 equivalent).sha512(): SHA-512 digest, backed by DataFusion's built-incrypto::sha512UDF.fnv(): FNV-1a 64-bit hash, implemented as a customScalarUDF.murmur3(): MurmurHash3 32-bit hash, implemented as a customScalarUDF.xxh3(): XXH3 64-bit hash, implemented as a customScalarUDFusingxxhash-rust.xxh128(): XXH3 128-bit hash, implemented as a customScalarUDFusingxxhash-rust.Files
consts.rs,parser.rs: register all 7 functions as external functions withparam_placeholders(1).pipeline/expr.rs: import DataFusion'smd5(),sha512()and the new custom UDFs, add them toDataFusionFunctionDef::from_func_namewithrequires_dict_downcast: true.pipeline/functions/fnv.rs(new): customFnvHashFunc- FNV-1a 64-bit, returnsInt64.pipeline/functions/murmur3.rs(new): customMurmur3HashFunc- MurmurHash3 32-bit, returnsInt64.pipeline/functions/sha1.rs(new): customSha1Funcusing thesha1crate, returnsBinary.pipeline/functions/xxh3.rs(new): customXxh3Funcusingxxhash-rust, returnsInt64.pipeline/functions/xxh128.rs(new): customXxh128Funcusingxxhash-rust, returnsBinary(16 bytes big-endian).pipeline/functions.rs: module declarations andmake_udf_function!registrations.crates/query-engine/Cargo.toml: addsha1andxxhash-rustworkspace dependencies.What issue does this PR close?
How are these changes tested?
Two layers of tests, all in this PR:
fnv,murmur3,sha1,xxh3,xxh128): verify scalar string input, binary input, and null handling.pipeline::assignend-to-end tests for every function, exercising both OPL and KQL parsers through the full pipeline. Binary-returning functions (md5,sha1,sha512,xxh128) are wrapped withencode(..., "hex")andthe output hex string is asserted. Integer-returning functions (
fnv,murmur3,xxh3) assert theInt64value directly.cargo xtask quick-checkpasses clean.Are there any user-facing changes?
Yes users of the transform processor / query-engine can now call these hash functions in OPL/KQL programs: